Context-Based Chinese Word Segmentation using SVM Machine-Learning Algorithm without Dictionary Support
نویسندگان
چکیده
This paper presents a new machine-learning Chinese word segmentation (CWS) approach, which defines CWS as a break-point classification problem; the break point is the boundary of two subsequent words. Further, this paper exploits a support vector machine (SVM) classifier, which learns the segmentation rules of the Chinese language from a context model of break points in a corpus. Additionally, we have designed an effective feature set for building the context model, and a systematic approach for creating the positive and negative samples used for training the classifier. Unlike the traditional approach, which requires the assistance of large-scale known information sources such as dictionaries or linguistic tagging, the proposed approach selects the most frequent words in the corpus as the learning sources. In this way, CWS is able to execute in any novel corpus without proper assistance sources. According to our experimental results, the proposed approach can achieve a competitive result compared with the Chinese knowledge and information processing (CKIP) system from Academia Sinica.
منابع مشابه
Chinese Word Segmentation by Classification of Characters
During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM,...
متن کاملA Hybrid Algorithm based on Deep Learning and Restricted Boltzmann Machine for Car Semantic Segmentation from Unmanned Aerial Vehicles (UAVs)-based Thermal Infrared Images
Nowadays, ground vehicle monitoring (GVM) is one of the areas of application in the intelligent traffic control system using image processing methods. In this context, the use of unmanned aerial vehicles based on thermal infrared (UAV-TIR) images is one of the optimal options for GVM due to the suitable spatial resolution, cost-effective and low volume of images. The methods that have been prop...
متن کاملImproving Chinese Word Segmentation by Adopting Self-Organized Maps of Character N-gram
Character-based tagging method has achieved great success in Chinese Word Segmentation (CWS). This paper proposes a new approach to improve the CWS tagging accuracy by combining Self-Organizing Map (SOM) with structured support vector machine (SVM) for utilization of enormous unlabeled text corpus. First, character N-grams are clustered and mapped into a low-dimensional space by adopting SOM al...
متن کاملUnknown Word Identification for Chinese Morphological Analysis ∗
Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. Besides word segmentation, we also need to identify the part-of-speech (POS) tags of the words. The segmentation and POS tagging process are denoted as morphological analysis. During the process of word segmentation, two main problems o...
متن کاملHigh Speed Unknown Word Prediction Using Support Vector Machine for Chinese Text-to-Speech Systems
One of the most significant problems in POS (Part-of-Speech) tagging of Chinese texts is an identification of words in a sentence, since there is no blank to delimit the words. Because it is impossible to pre-register all the words in a dictionary, the problem of unknown words inevitably occurs during this process. Therefore, the unknown word problem has remarkable effects on the accuracy of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013